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Abstract 


The  problem  of  sorting  n  integers  from  a  restricted  range  where  m  is 

superpolynomial  in  n,  is  considered.  An  o(nlogn)  randomized  algorithm  is  given. 
Our  algorithm  takes  O(nloglogm)  expected  time  and  0(n)  space.  (Thus,  for  m  = 
nvuiyivy P0  we  have  an  0(n  log  log  n)  algorithm.)  The  algorithm  is  parallelizable.  The 
resulting  parallel  algorithm  achieves  optimal  speed  up.  Some  features  of  the  algorithm 
make  us  believe  that  it  is  relevant  for  practical  applications. 

A  result  of  independent  interest  is  a  parallel  hashing  technique.  The  expected 
construction  time  is  logarithmic  using  an  optimal  number  of  processors,  and  Search¬ 
ing  for  a  value  takes  0(1)  time  in  the  worst  case.  This  technique  enables  drastic 
reduction  of  space  requirements  for  the  price  of  using  randomness.  Applicability  of 
the  technique  is  demonstrated  for  the  parallel  sorting  algorithm,  and  for  some  parallel 
string  matching  algorithms. 

The  parallel  sorting  algorithm  is  designed  for  a  strong  and  non  standard  model  of 
parallel  computation.  Efficient  simulations  of  the  strong  model  on  a  CRCW  PRAM 
are  introduced.  One  of  the  simulations  even  achieves  optimal  speed  up.  This  is 
probably  a  first  optimal  speed  up  simulation  of  a  certain  kind,  f <  - >  ,vl 
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The  problem  of  sorting  n  integers  from  a  restricted  range  where  m  is  super¬ 

polynomial  in  n,  is  considered.  An  o(n  log  n)  randomized  algorithm  is  given.  Our  al¬ 
gorithm  takes  O(nloglogm)  expected  time  and  O(n)  space.  (Thus,  for  m  =  npolyl°9^ 
we  have  an  0(n  log  log  n)  algorithm.)  The  algorithm  is  parallelizable.  The  resulting 
parallel  algorithm  achieves  optimal  speed  up.  Some  features  of  the  algorithm  make  us 
believe  that  it  is  relevant  for  practical  applications. 

A  result  of  independent  interest  is  a  parallel  hashing  technique.  The  expected 
construction  time  is  logarithmic  using  an  optimal  number  of  processors,  and  Searching 
for  a  value  takes  0(  1)  time  in  the  worst  case.  This  technique  enables  drastic  reduction 
of  space  requirements  for  the  price  of  using  randomness.  Applicability  of  the  technique 
is  demonstrated  for  the  parallel  sorting  algorithm,  and  for  some  parallel  string  matching 
algorithms. 

The  parallel  sorting  algorithm  is  designed  for  a  strong  and  non  standard  model  of 
parallel  computation.  Efficient  simulations  of  the  strong  model  on  a  CRCVV  PRAM  are 
introduced.  One  of  the  simulations  even  achieves  optimal  speed  up.  This  is  probably 
a  first  optimal  speed  up  simulation  of  a  certain  kind. 


1  Introduction 


Consider  the  problem  of  sorting  n  integers  drawn  from  a  given  range  [l..m].  A  new  ran¬ 
domized  algorithm  for  the  problem  is  presented.  Its  expected  running  time  is  0(n  log  log  m) 
using  O(n)  space.  The  algorithm  is  parallelizable.  The  resulting  parallel  algorithm  achieves 
optimal  speed  up.  The  result  implies  o(n  log  n )  expected  time  and  linear  space  for  m  <  2n0<1). 

’Partially  supported  by  NSF  grant  CCR-890G49  and  ONR  grant  N000M-85-0046. 
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More  specifically,  for  m  =  nlo®*n  (for  any  constant  k  >  1)  we  have  O(nloglogn)  expected 
time  and  0(n)  space.  No  such  result  is  known  for  deterministic  sorting,  suggesting  the  follow¬ 
ing  fundamental  open  problem  :  Is  this  an  instance  where  randomization  defeats  determinism 
for  sorting?  The  algorithm  seems  to  be  practical. 

The  paper  employs  two  Algorithmic  techniques: 

1.  A  first  randomized  parallel  hashing  technique,  which  achieves  optimal  speed  up  and  takes 
expected  logarithmic  time,  is  presented.  The  parallel  hashing  technique  enables  drastic 
reduction  of  space  requirements  in  quite  a  few  parallel  algorithms  for  the  price  of  using 
randomness.  The  technique  is  used  in  the  parallel  sorting  algorithm.  Such  (serial)  technique 
is  also  demonstrated  in  the  serial  sorting  algorithm.  The  new  parallel  hashing  technique, 
that  results  in  trading  space  foi  randomness,  is  likely  to  have  additional  applications.  We 
actually  demonstrate  it  with  a  few  examples. 

2.  The  parallel  sorting  algorithm  is  designed  for  a  strong  and  non  standard  model  of  parallel 
computation.  New  simulations  of  the  strong  model  on  a  CRCW  PRAM  are  introduced. 
Using  one  of  the  simulations  the  parallel  sorting  algorithm  runs  in  optimal  speed  up  also  on 
a  CRCW  PRAM.  Designing  algorithms  for  strong  and  non  standard  models  of  computation 
and  then  translate  them  into  standard  models  is  a  traditional  methodology  in  computer 
science.  We  expect  our  parallel  simulations  to  be  helpful  in  this  respect.  The  simulations 
are  efficient:  one  of  them  even  preserves  optimal  speed  up. 


1.1  Extant  work 

Sorting  is  a  fundamental  problem  that  has  received  much  attention.  [Knu73]  gives  several 
algorithms  for  sorting  n  objects  drawn  from  an  arbitrary  totally  ordered  domain  in  O(nlogn) 
time.  There  are  also  optimal  parallel  sorting  algorithms  in  logarithmic  time  [AKS83]  [Col86]. 
For  the  decision-tree  model  the  0(n  log  n)  time  serial  upper  bound  is  best  possible  [AHU74]. 

Because  of  the  central  role  that  the  sorting  problem  plays  in  computer  science,  numerous 
papers  are  devoted  to  study  opf  •  .nities  for  improving  this  time  bound  to  o(n  log  n).  One 
approach  is  to  consider  idealized  U  1  non  standard)  versions  of  the  RAM  model;  as,  for 
instance,  in  [KR84]  and  [PS80],  wiiere  very  large  words  are  assumed.  The  practicality  of 
such  an  assumption  is  unclear.  Another  approach  is  to  focus  on  instances  of  the  sorting 
problem,  where  the  input  consists  of  integers  drawn  from  a  restricted  interval  [l..m].  For 
m  =  O(n)  the  known  Bucket  Sort  algorithm  applies.  It  solves  the  problem  in  O(n)  time. 
For  m  =  poly(n)1  the  variant  of  the  Bucket  Sort  algorithm,  called  Radix  Sort,  runs  in  0(n) 
time  [Knu73].  More  precisely,  Radix  Sort  runs  in  0(kn)  time  for  m  =  nk.  Thus,  a  natural 
extension  of  the  Radix  Sort  would  result  with  an  o{n  log  n)  time  algorithm  for  m  <  n0*!o*n). 

'We  use  poly(n)  to  denote  “polynomial  in  n" ,  and  polylog(n)  to  denote  “polynomial  in  logn". 
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However,  for  m  =  nfi(lo5n)  Radix  Sort  does  not  improve  on  the  0(n  logn)  time  bound. 

The  second  approach  was  studied  for  parallel  computation  as  well.  Rajasekaran  and 
Reif  gave  an  optimal  randomized  parallel  algorithm  in  logarithmic  time  on  an  arbitrary- 
CRCVV  for  m  =  nlogcn,  for  any  constant  c  >  1  [RR89].  The  integer  sortin?  algorithm  of 
Rajasekaran  and  Reif  cannot  be  extended  for  m  p  ynomial  in  n.  For  m  =  ;  u),  Hagerup 

provided  an  O(logn)  time  and  0(n1+£)  space  (fo:  any  fixed  c  >  0)  parallel  algorithm,  using 
nlog  log  n/  log  n  priority-CRCW  processors  [Hag87].  No  optimal  parallel  algorithm  is  known 
for  this  range.  Our  sorting  algorithm  clearly  belongs  in  the  second  approach. 

Using  data  structures  presented  by  van  Emde  Boas,  Kaas  and  Zijlstra  [vEBKZ77],  John¬ 
son  [Joh82]  dealt  with  priority  queues  problems,  where  the  priorities  are  drawn  from  the 
integer  domain  [l..m].  A  corollary  of  his  result  is  an  O(nloglogm)  time  and  0(m1/,c)  space 
algorithm  for  sorting,  where  c  >  0  is  a  constant.  Johnson  recognizes  the  problem  with  the 
space  requirements  of  the  algorithm,  and  writes  that  the  algorithm  is  not  practical  and  only 
of  theoretical  interest. 

Kirkpatrick  and  Reisch  [KR84]  presented  an  algorithm,  based  on  a  range  reduction  tech¬ 
nique,  that  has  the  same  complexity  bounds  as  Johnson’s  algorithm.  They  state  that  the 
algorithm  is  of  little  practical  value  due  to  both  large  constants,  that  are  hidden  in  the 
asymptotic  bounds,  and  storage  requirements. 

The  following  open  question,  quoted  from  Kirkpatrick  and  Reisch  ([KR84]),  captures 
an  important  aspect  of  our  sorting  results:  “For  what  ranges  of  inputs  can  we  construct 
practical  o(nlogn)  integer  sorting  algorithms?”.  The  present  paper  provides  only  partial 
answers  to  this  question,  and  more  work  is  still  needed  in  order  to  resolve  this  question. 

The  issue  of  trading  space  for  randomness  using  random  hash  functions  belongs  in  the 
folklore  of  serial  algorithms.  For  instance,  the  survey  paper  [GG88]  demonstrates  such 
considerations  for  hashing  large  alphabets  for  string  algorithms  such  as  the  suffix  tree  data 
structure.  However,  for  parallel  algorithms  this  survey  mentions  only  deterministic  methods. 
This  immediately  suggests  an  implicit  open  problem. 

Recently,  [DadH89]  described  a  dynamic  data  structure  (dictionary)  that  using  random¬ 
ization  supports  the  instructions  insert,  delete,  and  lookup,  and  that  can  be  implemented  in 
parallel.  Time  bounds  of  the  form  0(ne )  using  <  n1-£  processors,  for  some  fixed  e  >  0,  are 
given.  However,  no  time  bounds  of  the  form  0(polylog(n))  are  claimed. 

Several  works  have  been  previously  done  on  relations  between  PRAM  models.  The 
interested  reader  is  referred  to  the  surveys  of  [EG88]  [KR88]  [Vis83j.  Randomization  was 
previously  used  in  the  context  of  parallel  simulations  by  [KU86]  [KRS88]  [MV84]  [Ran87], 


1.2  More  on  our  results 


As  model  of  computation  for  the  parallel  algorithms,  we  use  mostly  the  concurrent-read 
concurrent-write  parallel  random  access  machine  (CRCW  PRAM)  family.  The  members 
of  this  family  differ  by  outcome  of  the  event  where  several  processors  attempt  to  write 
simultaneously  into  the  same  shared  memory  location.  In  the  common-CRCW  all  these 
processors  must  attempt  to  write  the  same  value  (and  this  value  is  written).  In  the  arbitrary- 
CRCW  one  of  the  processors  succeeds,  but  we  do  not  know  in  advance  which  one.  In  the 
priority-CRCW  the  smallest  numbered  among  the  processors  succeeds.  The  above  three 
CRCW  models  are  considered  standard.  Next  we  mention  two  non  standard  models.  In  the 
min-CRCW  PRAM  the  processor  that  tries  to  write  the  minimum  value  succeeds.  In  the 
fetch&add-CRCW  PRAM  the  values  are  added  to  the  value  already  written  in  the  shared 
memory  location  and  all  sums  obtained  in  the  (virtual)  serial  process  are  recorded.  Finally, 
in  an  exclusive-read  exclusive- write  (EREW)  PRAM  simultaneous  access  of  more  then  one 
processor  into  the  same  shared  memory  location  is  not  allowed. 

A  parallel  algorithm  achieves  optimal  speed-up  if  its  time x processor  product  matches  the 
number  of  operations  of  the  fastest  serial  algorithm  for  the  problem.  Typically,  we  will  state 
our  parallel  results  in  the  following  form:  “x  operations  and  t  time".  Throughout  this  paper, 
this  will  always  translate  into  “f  time  using  x/f  processors”.  The  papers  [EG88],  [KRS8S], 
[KRS8J  and  [Vis83]  overview  research  directions  on  parallel  algorithms.  All  of  them  concede 
that  achieving  optimal  speed-up,  or  at  least  approaching  this  goal,  is  a  crucial  property  for 
parallel  algorithms  that  are  intended  to  be  practical.  A  secondary  (but  very  important)  goal 
is  to  minimize  parallel  time.  Another  critical  practical  concern  is  space  requirements.  These 
guidelines  led  us  in  designing  the  algorithms  of  the  present  paper. 

Our  (main)  randomized  sorting  algorithms  are  presented  as  follows.  We  first  present  in 
Section  2  a  deterministic  algorithm  that  takes  0(nloglogm)  time  and  uses  0(ml^c)  space, 
for  any  fixed  c  >  1.  A  parallel  version  of  this  deterministic  algorithm  takes  0(logn)  time 
and  0(ml'c)  space,  for  any  fixed  c  >  1,  using  n  log  log  m/  logn  processors  (optimal  speedup). 
This  parallel  algorithm  is  designed  for  the  non-standard  min-CRCW  processors. 

The  second  element  in  our  presentation  is  described  in  Section  3.  A  randomized  parallel 
hashing  scheme  is  presented.  It  is  optimal  and  takes  logarithmic  time.  Specifically,  let  W 
be  a  given  set  of  n  numbers  from  an  arbitrary  large  domain  [l..m].  We  show  how  to  find 
a  one-to-one  mapping  F  :  W  — *  R,  where  \R\  =  0(n),  by  a  randomized  algorithm  on  the 
arbitrary-CRCW  PRAM.  This  mapping  is  computed  in  0(log  n)  expected  time,  using  ^2- 
processors.  Evaluation  of  F(x),  for  each  x  €  W,  takes  0(1)  worst  case  time.  The  connection 
to  the  sorting  results  is  as  follows.  We  show  how  to  use  these  hash  functions  in  order  to 
reduce  the  space  requirements  in  both  the  serial  and  parallel  algorithms  to  only  0(n)  space. 
The  penalty  is  that  now  we  have  randomized  algorithms  rather  than  deterministic.  For  the 
parallel  algorithm,  the  parallel  time  increases  to  0(log  n  log  log  m)  while,  the  operation  count 
does  not  increase  (asymptotically)  on  the  min-CRCW  model. 
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The  third  element  in  our  presentation  is  described  in  Section  4.  Simulations  of  the 
min-CRCW  PRAM  model  by  weaker  models  are  presented.  Some  of  the  simulations  are 
randomized.  Some  simulations  apply  also  to  the  fetch&add-CRCW  model. 

Denote  a  CRCW  PRAM  with  a  shared  memory  of  size  S  as  CRCW(S)  PRAM.  In  the 
following,  we  list  some  upper  bounds  for  simulating  one  step  of  an  n-processor  min-CRCW(S) 
PRAM: 


•  O(logn)  expected  time  on  an  j^-^-processor  arbitrary-CRCW(5  +  0(n ))  PRAM  (op¬ 
timal  speedup). 

•  O(logn)  time  on  an  — --pro cessor  priority-CRCW(0(S  +  n1+t))  PRAM  (e  >  0). 

•  O(loglogm)  time  on  an  n-processor  arbitrary-CRCW(0(mS))  PRAM,  where  m  is  an 
upper  bound  for  the  value  that  can  be  written  to  a  memory  cell  by  the  min-CRCW 
PRAM. 


The  first  result  is  an  improvement  over  a  previously  known  result  [EG88]  where  there 
was  a  restriction  that  the  memory  addresses  being  used  by  the  simulated  min-CRCW  are  of 
at  most  0(log  n)-bit  size.  The  result  can  be  extended  to  simulating  one  step  of  a  fetch&add- 
CRCW  PRAM.  We  are  not  aware  of  similar  (i.e.,  optimal  simulation)  results  even  for  sim¬ 
ulation  of  the  (relatively  weaker)  priority-CRCW  PRAM  by  an  arbitrary-CRCW  PRAM. 
The  last  two  simulations  are  deterministic. 

Combining  the  optimal  simulation  from  Section  4,  the  parallel  hashing  scheme  in  Section 
3  and  the  algorithm  in  Section  2  we  derive  a  randomized  parallel  algorithm  for  sorting 
n  integers  from  the  range  [l..m]  on  an  arbitrary-CRCW  that  achieves  0 (log  n  log  log  m) 
expected  time,  0(n  log  log  m)  expected  number  of  operations  and  0(n)  space. 

Not  only  the  space  efficient  parallel  sorting  result  can  benefit  from  the  simulations.  Recall 
the  original  min-CRCW  PRAM  sorting  algorithm  of  Sec.  2.  Together  with  some  of  these 
simulations,  we  get  the  following  deterministic  sorting  results  on  standard  PRAM  models:  (1) 
O(logn  log  log  m)  time  and  0(m')  space  (with  any  fixed  e  >  0)  using  -  priority-CRCW 

processors;  and  (2)  O(logn)  time  and  0(m£)  space  (e  >  0)  using  nn°j^m->2  arbitrary-CRCW 
processors. 

Some  of  the  ideas  we  use  in  the  deterministic  algorithms  of  Section  2  go  back  to  [vEBKZTT], 
These  ideas  were  inspired  also  by  the  algorithms  of  [Hag87]  and  [Joh82].  Johnson’s  algo¬ 
rithm  has  the  same  complexity  as  our  deterministic  serial  algorithm.  However,  our  sorting 
algorithm  has  two  advantages:  it  is  simpler  and  parallelizable. 

The  proposed  parallel  hashing  scheme  may  be  a  useful  tool  for  parallel  algorithms  that 
use  large  space.  We  demonstrate  this  with  several  algorithms  (in  addition  to  our  sorting 
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algorithm)  for  which  the  space  requirement  is  large.  By  using  the  proposed  parallel  hashing 
scheme,  they  become  efficient  and  possibly  practical. 

There  are  applications  that  relate  to  combinatorial  algorithms  on  strings.  If  the  alphabet 
is  large  then  a  naming  assignment  procedure  for  substrings  is  essential  to  avoid  large  space. 
Such  deterministic  procedure,  due  to  [GG88],  takes  O(nlogn)  operations  and  evaluation  of 
a  name  takes  0(log  n)  time.  (We  actually  referred  earlier  to  the  same  procedure.)  Using  our 
parallel  hashing  scheme,  the  naming  assignment  takes  O(n)  expected  number  of  operations 
and  evaluation  of  a  name  takes  0(1)  time  in  the  worst  case. 

Another  application  is  in  the  construction  of  suffix  trees.  Known  parallel  algorithms  for 
this  problem  require  0(n1+e)  space  [AIL+88]  or  0(n  log  n  +  m2)  space  [GG88]  where  n  and 
m  are  the  lengths  of  the  text  string  and  the  pattern  string  respectively.  Space  requirements 
in  both  algorithms  can  be  reduced  to  0(n)  and  O(nlogn),  respectively,  in  exchange  for 
randomization  and  an  O(logn)  increase  in  time  (but  no  change  in  number  of  operations). 

The  parallel  hashing  scheme  may  be  used  to  reduce  the  space  in  Hagerup’s  sorting  al¬ 
gorithm  [HagST]  from  0(n1+e)  to  0(n )  in  exchange  for  randomization  and  an  O(loglogn) 
increase  in  time  (but  no  change  in  number  of  operations).  The  parallel  hashing  scheme  is 
also  used  in  the  optimal  simulation  in  Section  4. 

Recall  that  we  use  the  non  standard  min-CRCW  model  in  Section  2.  Note  that  we  do 
not  advocate  this  model  as  an  alternative  for  existing  “acceptable”  theoretical  models  for 
parallel  computation.  The  methodological  attitude  here  is:  (1)  design  the  algorithm  on  the 
min-CRCW  model,  and  (2)  show  later  how  to  simulate  this  model  on  more  acceptable  ones. 

Postscript.  After  all  results  in  the  present  paper  have  been  achieved  and  a  first  draft 
has  been  distributed,  we  found  out  that  very  recently  Bhatt  et  al.  [BDH+89]  designed 
independently  a  parallel  deterministic  integer  sorting  algorithm  which  is  related  (though, 
not  identical)  to  the  basic  construction  of  Section  2.1  below.  Using  a  new  list  ranking 
algorithm  they  were  even  able  to  reduce  the  time  to  (3(log  n/  log  log  n+log  log  m)  maintaining 
optimal  speed  up.  However,  none  of  the  randomized  and  only  part  of  the  deterministic  space 
reductions  ideas  of  the  present  paper  appear  there.  Finally,  we  note  that  we  do  not  see 
how  to  use  their  results  for  further  improving  our  randomized  results.  This  is  since  the  list 
ranking  part  is  not  the  main  bottleneck  for  improving  our  space-efficient  randomized  sorting 
algorithm. 


2  The  Deterministic  Algorithm 


We  consider  the  following  problem: 

Input:  Sequence  x(l] . ifn]  of  distinct  integers  drawn  from  the  domain  [l..m],  for  some 
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integer  m.  Problem:  Sort  the  sequence.  (Formally,  compute  the  permutation  7r  of  {1, ...,  n} 
such  that  x[7r(l)], ...,  x[zr(n)]  is  sorted  in  non-decreasing  order.) 

In  Section  2.4  we  discuss  ways  for  withdrawing  the  distinction  assumption. 

For  presentation  purposes,  we  falsely  assume  that  the  sequence  x[l], ...,  x[n]  is  given  in 
the  following  redundant  form.  There  is  a  domain  array  of  bits  D[\..m\  so  that  D[i ]  =  1  if 
the  value  of  some  element  x\j]  is  i  and  D[i\  =  0  otherwise.  Given  bit  Z)[z]  =  1  we  define  the 
smallest  i\  >  i  such  that  D[i\]  =  1  to  be  the  right  neighbor  of  x[i\. 

This  neighborhood  relation  translates  easily  to  the  values  of  the  input  sequence:  the 
domain  right  neighbor  drn[i\  of  x[t]  is  the  element  x[j\  =  min{x[&]  :  x[Ar]  >  x[i]}.  The  array 
dm  defines  a  linked  list,  where  the  elements  preceding  x[i]  in  the  linked  list  are  smaller  than 
x[i]  and  the  elements  succeeding  x[i]  are  larger.  The  distance  from  the  beginning  of  the  list 
is  the  rank  of  x[i]  relative  to  the  input  elements. 

We  solve  the  sorting  problem  in  two  steps: 


(a)  Compute  the  domain  right  neighbor  of  each  index  i. 

(b)  For  each  element  x[i],  compute  its  rank  r  in  the  linked  list  defined  by  the  dm,  and  let 

7r (r)  be  i. 


Step  (b)  can  be  (trivially)  done  in  0(n )  time  or  in  parallel  time  O(logn)  and  optimal  speed 
up  using  a  the  List  Ranking  algorithm  ([AM8S]),  [CV86],  [CV88],  [CV89]).  Therefore,  our 
main  concern  is  solving  the  domain  right  neighbor  problem.  For  simplicity  we  assume  that 
m  =  22  for  some  integer  t  >  1.  The  domain  right  neighbor,  as  defined  above,  is  the  nearest 
neighbor  from  the  right.  We  will  also  need  a  definition  of  the  domain  left  neighbor ,  dln[i],  of 
element  x[xj:  the  element  x\j\  =  max{x[&]  :  x[fc]  <  x[z]}. 

2.1  Algorithm  for  finding  Domain  Nearest  Neighbors 

The  algorithm  is  recursive.  The  main  effort  is  in  defining  precisely  the  problem  that  is  being 
solved  recursively.  The  recursive  algorithm  will  provide  solutions  for  the  problems  of  finding 
left  and  right  domain  nearest  neighbors.  For  each  element  the  recursive  algorithm  separately 
treats  the  domain  right  neighbor  and  the  domain  left  neighbor  computations.  This  is  done 
by  duplicating  each  element  x[i]  into  a  left  copy  x/[z]  and  a  right  copy  xr [*] .  Intuitively, 
copy  xr[i]  is  "‘responsible”’  for  finding  the  domain  right  neighbor  and  x/[i]  is  “responsible" 
for  finding  the  domain  left  neighbor.  Initially,  X([z'J  =  x[t]  and  xr[z]  =  x[i]. 

In  addition,  two  auxiliary  copies  are  added  xr[n  +  1]  and  x;[n  +  1],  Copy  xr[n  +  1]  =  1 
is  the  domain  left  neighbor  of  the  smallest  input  element.  Similarly,  copy  x/(n  +  1]  =  m  is 
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the  domain  right  neighbor  of  the  largest  input  element.  (Note  that  at  this  stage  only  for 
i  =  n  +  1,  xr[i]  is  not  equal  x/[z].)  We  assume,  without  loss  of  generality,  that  the  input 
elements  are  from  [2..m  —  1]. 

Informally,  our  algorithm  works  as  follows:  The  input  elements  are  from  an  interval 
I.  Interval  I  is  partitioned  into  small  intervals,  defining  local  problems  of  domain  nearest 
neighbors  searches.  For  each  subinterval  /*.,  at  most  two  elements  (smallest  left  copy  and 
largest  right  copy)  might  not  have  their  neighbors  in  /*.  Such  elements  are  collected  into  the 
global  sets  GR  and  GL.  Thus,  a  problem  on  interval  I  is  reduced  recursively  into  several 
local  problems  on  subintervals  Ik  and  one  global  problem.  By  choosing  Ik  to  be  of  size  <J\T\ 
(for  all  k),  we  have  that  all  local  problems  and  the  global  problem  are  with  intervals  of  size 

#i. 

The  input  for  the  recursive  algorithm  includes  two  sets  L  and  R ,  whose  values  belong  to 
an  interval  /  of  integers.  /  is  of  the  form  a  +  [l..m/],  i.e.,  I  —  {a-f  1,  a  +  2, ...,  a  +  m'}  for  some 
a  and  m'.  Set  L  will  always  represent  a  non-empty  subset  of  the  left  copies  and  set  R  will 
always  represent  a  non-empty  subset  of  the  right  copies.  Each  element  in  set  L  searches  for 
its  left  neighbor  within  set  R.  This  left  neighbor  is  defined  as  the  largest  element  in  R  which 
is  smaller  than  it.  Similarly,  each  element  in  set  R  searches  for  its  right  neighbor  defined  as 
the  smallest  element  in  L  which  is  larger  than  it. 

Initially,  the  interval  I  is  [l..m]  (i.e.,  a  =  0  and  m'  =  m),  L  is  {x;[l], ...,  x;[n  -f  I]}  and  R 
is  {xr [1], ...,  xr[n  +  1]}. 

The  recursive  algorithm  DNN 

Input:  L,  R  and  /,  where  L  and  R  are  nonempty  sets  and  /  =  a  +  [l..m]  is  an  interval  of  integers.  We  refer 
to  m  =  |/|  as  the  size  of  the  problem. 

A  processor  stands  by  each  element  in  L  and  each  element  in  R. 

if  m  =  2  ( comment :  the  situation  for  a  recursive  problem  for  which  m  =  2  is  characterized  in  Corollary  2) 

then  Declare  the  element  of  L  to  be  the  right  neighbor  of  the  element  of  R  and  the  element 
of  R  to  be  the  left  neighbor  of  the  element  of  L. 

else 

(1)  Partition  set  L  into  q  =  \Jrn  subsets  L0,  L1?...,  Lq_\  where  Lk  contains  elements  (of  L) 

from  the  interval  Ik  =  a  +  k  ■  q  +  [l..g],  for  k  =  0,...,7  —  1.  Similarly,  partition  set 
R  into  q  =  y/rri  subsets  Ro,  Rx, ...,  where  Rk  contains  elements  (of  R)  from  the 
interval  Ik  =  a  +  k  ■  q  +  [I..7],  for  k  =  0, ....  q  —  1. 

(2)  Let  at  be  the  smallest  element  in  Lk.  If  cik  is  less  than  or  equal  to  the  smallest  element 

in  Rk  then:  (a)  Select  the  integer  k  into  the  new  set  GL  (integer  k  represents  element 
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a/c  and  will  be  referred  to  as  ak).  (b)  Remove  a*,  from  Lk- 

Similarly,  let  bk  be  the  largest  element  in  Rk -  If  bk  is  greater  than  or  equal  to  the 
largest  element  in  Lk  then:  (a)  Select  the  integer  k  into  the  new  set  GR  (integer  k 
represents  element  bk  and  will  be  referred  to  as  6*).  (b)  Remove  bk  from  Rk- 

(3)  Do  the  following  recursive  calls:  (a)  For  each  pair  of  nonempty  (local)  subsets  Lk  and 
Rk  solve  the  problem  for  an  input  consisting  of  Lk,  Rk  and  Ik-  (b)  If  (global)  sets  GL 
and  GR  are  nonempty  then  solve  the  problem  for  an  input  consisting  of  GL,  GR  and 
J  =  a  +  [0..?  -  1]. 


Comments: 

•  If  the  smallest  (resp.  largest)  element  of  L  (J  R  in  Ik  is  in  L  (resp.  in  R)  then  its  left 
(resp.  right)  neighbor  is  not  in  Ik-  We  collect  the  elements  from  L  (resp.  from  R) 
whose  left  (resp.  right)  neighbor  is  not  in  Ik  into  the  global  set  GL  (resp.  set  GR). 

•  The  algorithm  advances  to  the  deepest  level  of  recursion  and  simply  terminates  (with¬ 
out  any  backtracking). 

•  All  recursive  calls  of  the  algorithm  are  performed  simultaneously  in  parallel. 

2.2  Correctness 

Proposition  1  Let  x[j]  be  the  left  neighbor  of  x[i\.  Then  for  each  level  of  the  recursion  the 
following  properties  hold: 

(a)  The  values  of  all  elements  in  L  are  distinct  and  the  values  of  all  elements  in  R  are 
distinct. 

(b)  Xf[t]  and  xr\j]  are  both  represented  in  the  same  recursive  sub-problem. 

(c)  xr[j]  <  Z/[tj. 

(d)  xT\j\  is  the  left  neighbor  of  xi[i\. 

(e)  Each  element  is  represented  in  exactly  one  recursive  sub-problem. 

(f)  Set  L  is  nonempty  if  and  only  if  set  R  is  nonempty. 

The  proof  of  Proposition  1  is  given  in  Appendix  A. 

Corollary  2  If  a  problem  with  input  L.  R  and  I  is  of  size  2  then  \L\  =  ji?|  =  1  and  the 
element  in  R  is  the  left  neighbor  of  the  element  in  L. 
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Proof:  Assume  that  xj[i]  €  L  and  that  x\j]  is  the  left  neighbor  of  x[i].  Let  I  =  [a,  a  +  1]  for 
some  integer  a.  Following  from  property  ( b )  in  Proposition  1,  xr\j ]  €  R.  Thus,  from  property 
(c)  we  have  xT\j ]  =  a  and  X/[i]  =  a  +  1.  If  L  >  1  then  by  property  (a)  a  second  element  can 
only  be  xi[i']  =  a  (for  some  i'  ^  i ).  However,  following  property  (c)  its  left  neighbor  must 
be  <  a,  contradicting  property  (6).  Similarly,  R  may  contain  only  one  element.  g 

Corollary  3  Suppose  that  initially  m,  the  size  of  the  interval  from  which  the  input  elements 
are  drawn,  is  22  for  some  integer  t  >  1  ( and  t  =  log  log  m).  Consider  a  recursive  problem 
at  recursion  level  t.  The  interval  I  from  which  the  input  elements  for  such  recursive  problem 
are  drawn  consists  of  two  successive  integers.  That  is,  for  some  integer  x,  I  =  z  -f  [1..2]. 
Furthermore,  L  =  {x  +  2}  and  R  =  {x  +  1}  and  x  +  2  is  the  right  neighbor  of  x  +  1. 

2.3  Complexity  and  implementation 

We  start  by  discussing  the  complexity  of  algorithm  DNN  and  later  state  the  sorting  results. 


Lemma  4  Given  are  n  elements  with  distinct  values  from  the  interval  of  integers  [l..m].  (1) 
Algorithm  DNN  works  serially  in  0(n  log  log  m)  time  and  O(m)  space.  (2)  Algorithm  DNN 
works  in  O(loglogm)  time  and  O(m)  space,  using  n  processors  on  a  min-CRCW  PRAM. 

Proof:  The  size  of  interval  I  for  each  recursive  sub-problem  is  bounded  by  v/m.  Therefore, 
D(m),  the  depth  of  the  recursion,  satisfies  D(m)  <  D{s/m)  -f  0(1),  implying  D(m)  = 
O(loglogm).  For  each  level  of  the  recursion  we  need  to  perform  at  most  0(n)  operations 
and  therefore  the  total  number  of  operations  is  0(n  log  log  m). 

To  finish  deriving  the  serial  result  we  “remember”  for  each  element  in  each  level  of  the 
recursion  its  original  index  in  the  initial  sets  L  and  R.  The  space  needed  for  all  subproblems 
together  in  each  level  of  the  recursion  is  0(m).  Since  we  can  reuse  the  space  for  the  highest 
level  of  the  recursion,  we  get  a  total  of  0(m )  space,  as  well.  Item  (1)  of  the  lemma  follows. 

We  proceed  to  the  parallel  result.  Initially,  we  assign  one  processor  to  each  copy  of 
each  element  (i.e.,  each  element  of  the  original  sets  L  and  R).  This  assignment  remains 
throughout  the  entire  algorithm. 

For  step  (2),  we  need  to  have  in  a *  (resp.  b k),  for  k  =  0,  ..,q  —  1,  the  minimum  (resp. 
maximum)  element’s  value  over  all  the  elements  that  belong  to  (resp.  R *).  We  already 
argued  (implicitly  item  (1)  of  the  lemma)  that  sequentially  this  can  be  trivially  done  in  0(n ) 
time.  In  parallel,  step  2  can  be  done  in  0(1)  time,  using  n  processors.  Here  is  where  we 
take  advantage  of  the  min-CRCW  PRAM,  where  if  several  processors  try  to  write  into  the 
same  memory  location,  only  the  one  with  the  minimal  value  succeeds.  Item  (2)  of  Lemma  4 
follows.  | 
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2.4  Extensions 


Withdrawing  the  distinctness  assumption. 

We  assumed  that  all  elements  in  the  input  sequence  are  distinct.  The  purpose  of  this 
subsection  is  to  extend  algorithm  DNN  and  show  that  this  assumption  is  not  necessaty.  It 
is  trivial  to  achieve  distinctness  by  replacing  each  input  element  x[z]  by  the  pair  <  x[z],z  >. 
The  problem  is  that  this  enlarges  the  required  space  to  0(nm )  Fortunately,  this  will  not 
cause  any  problem  in  deriving  the  randomized  sorting  results  in  Section  4.1. 

However,  the  problem  remains  if  we  are  interested  in  deterministic  results.  The  remainder 
of  this  subsection  is  devoted  to  giving  alternative  extensions  that  are  less  space  consuming. 

We  allocate  processor  i  to  element  z,  for  each  1  <  i  <  n.  For  each  element  z,  let  s(i)  be 
the  smallest  index  of  an  element  equal  to  element  i  (formally  x[z]  =  x[s(z)]  and  for  j  <  s(i), 
x[z]  ^  x[;']).  We  make  use  of  a  bulletin  board  BB[\..m]  (this  term  was  previously  used  by 
Galil  [Gal84]). 

Step  1.  Processor  z  writes  i  into  location  BB[x[i)\.  Since  the  min-CRCW  PRAM  is  used 
location  BB[x[i}]  will  contain  s(i).  Step  1  uses  0(n)  space. 

Only  processors  (and  their  respective  elements)  that  succeed  in  writing  their  index 
participate  in  Step  2.  Formally,  these  are  all  elements  z  for  which  s(i)  =  z.  All 
elements  participating  in  Step  2  are  distinct. 

Step  2.  Apply  the  DNN  algorithm.  As  a  result  we  get  the  domain  right  neighbor  relative 
to  all  elements  participating  in  Step  2. 

Step  3.  Associate  the  pair  <  s(i),  i  >  with  element  z,  for  each  z,  and  impose  a  lexicographic 
order  on  these  pairs.  Apply  the  DNN  algorithm  with  respect  to  these  pairs.  (It  is  easy 
to  map  these  pairs  into  the  domain  of  integers  [l..rz2],  preserving  the  lexicographic 
order.  So  Step  3  needs  0(n 2)  space.) 

Step  4  derives  the  desired  right  neighbor  relation  from  the  outcome  of  Steps  2  and  3. 
Let  x[&]  be  the  right  neighbor  of  x[s(z)],  as  a  result  of  Step  2,  and  let  <  s(j),j  >  be 
the  right  neighbor  of  the  pair  <  s(z),  i  >,  as  a  result  of  Step  3. 

Step  4- 

if  s(i)  =  s(j) 

then  x[j\  is  the  right  neighbor  of  x[z] 
else  x(A-j  is  the  right  neighbor  of  xfz'J. 

We  conclude, 
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Lemma  5  Given  are  n  elements  with  values  from  the  interval  of  integers  (1)  The 

extension  of  algorithm  DNN  works  serially  in  O(nloglogm)  time  and  0(m  +  n 2)  space.  (2) 
The  extension  of  algorithm  DNN  works  in  O(loglogm)  time  and  0(m  +  n2)  space,  using  n 
processors  on  a  min-CRCW  PRAM. 


Reducing  (deterministically)  the  space  requirements. 

For  simplicity,  let  us  assume  that  m  >  n2.  The  reason  is  that  our  results  extend  known 
results  mostly  when  this  assumption  holds. 


Lemma  6  Given  are  n  elements  with  values  from  the  interval  of  integers  [l..m].  Algorithm 
DNN  can  be  further  enhanced  to:  (1)  run  serially  in  0(n  log  log  m)  time  and  0(m«)  space, 
where  c  >  1  is  any  fixed  constant.  (2)  run  in  O(loglogm)  time  and  0{m~)  space  (for  any 
constant  c>  1),  using  n  processors  on  a  min-CRCW  PRAM. 


Proof:  The  output  of  the  extended  DNN  algorithm  defines  a  linked  list  of  the  elements 
defined  by  the  right  neighbor  relation.  This  linked  list  is  stable  in  the  following  sense:  if 
x[z]  =  x[j]  for  some  i  <  j  then  element  i  precedes  element  j  in  the  linked  list.  A  consequence 
is  that  an  iterative  method,  in  the  spirit  of  Radix  Sort,  can  be  applied.  Thus,  given  an 
algorithm  of  time  T  and  space  5,  for  each  integer  c  >  0,  we  can  have  a  DNN  algorithm  with 
0(cT )  time  and  0(Sl/c)  space,  g 

Our  primary  concern  in  this  section  is  the  sorting  problem.  So,  if  we  add  the  missing  list 
ranking  step  to  the  DNN  algorithm  we  get, 


Theorem  1  Given  are  n  elements  with  values  from  the  interval  of  integers  [l..m],  our  sort 
ing  algorithms  achieve  the  following  results:  (1)  0(n  log  log  m)  serial  time  and  0(m«)  space, 
where  c  >  1  is  any  fixed  constant.  (2)  O(logn)  parallel  time  and  0(m  =  )  space  (for  any 
constant  c  >  1J,  using  n  processors  on  a  min-CRCW  PRAM. 


If  the  input  is  from  a  range  polynomial  in  n  then  we  have  the  same  complexities  as  in 
Hagerup’s  algorithm  [Hag87].  The  range  for  which  our  algorithm  gives  better  results  than  the 
best  known  algorithms  is  for  nlosl°sn  <C  m  2n0(1),  where  •<  denotes  smaller  asymptotically. 
Thus,  for  example,  for  m  =  npolyl°3^  we  have: 


Corollary  7  n  integers  from  the  range  [l..nl08*  n]  (for  every  constant  k  >  0)  can  be  sorted  in: 

]Q^k  n 

(l)  0(rc  log  log  n)  serial  time  and  0(n  <=  )  space,  for  any  constant  c  >  1;  and  (2)  O(logn) 

parallel  time  and  0(n~^~)  space,  for  any  constant  c  >  1,  using  n n  processors  on  a 
min-CRCW  PRAM. 
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3  Trading  Space  for  Randomness 


In  this  section  we  show  how  to  use  randomization  in  order  to  reduce  the  space  complexity 
of  the  algorithms  to  0(n).  The  randomization  is  of  the  “Las- Vegas”  type  algorithm.  That 
is,  some  of  the  steps  of  the  space  reducing  algorithm  are  based  on  randomized  moves  and  it 
never  errs. 

Recall  that  in  each  recursive  level  of  algorithm  DNN,  there  are  O(m)  variables  ak  and  bk. 
However,  in  each  level  of  the  algorithm  only  O(n)  of  these  variables  are  actually  used.  For 
a  given  level  denote  by  W  the  set  of  ak  and  bk  variables  that  are  being  used  (|W|  =  O(n)). 
The  deterministic  implementation  in  Sec.  2  requires  O(m)  space  for  the  ak  and  bk  variables. 
The  key  idea  here  is  to  randomly  select  a  hash  function  for  mapping  the  set  W  into  0(n ) 
space.  In  order  to  avoid  collisions,  we  shall  use  a  perfect  hash  function  (which  is  a  one-to-one 
function). 

Remark.  For  the  serial  algorithm,  much  simpler  hash  functions  would  suffice  in  order  to 
reduce  the  space  to  O(n).  From  now  on  we  concentrate  on  the  parallel  algorithm.  Based  on 
this,  the  implementation  for  the  serial  algorithm  will  be  straightforward. 

A  basic  procedure  for  constructing  a  perfect  hash  function  is  presented.  Its  expected 
running  time  is  logarithmic  and  its  expected  number  of  operations  is  linear.  Specifically,  we 
prove  in  Subsection  3.1  the  following  theorem: 


Theorem  2  Let  W  be  a  set  of  n  numbers  from  the  range  [l..m],  where  m  -f  1  =  p  is  prime. 
Suppose  we  have  processors  on  an  arbitrary-CRCW  PRAM.  A  one-to-one  function  F  : 
W  — >  [1..5n]  can  be  found  in  O(logn)  expected  time.  The  evaluation  of  F(x),  for  each 
x  G  W,  takes  0(1)  arithmetic  operations  with  numbers  from  [l..m]. 


We  show  later  that  Theorem  2  leads  to  the  following: 


Corollary  8  A.  Algorithm  DNN  (and  its  extension)  takes  0(log  n  log  logm)  expected  time 
and  0(n )  space,  using  processors  on  a  min-CRCW  PRAM.  B.  The  same  performance 
is  obtained  for  the  problem  of  sorting  n  integers  drawn  from  a  domain  of  size  m. 


3.1  Constructing  a  parallel  perfect  hash  function 

In  this  subsection  we  prove  Theorem  2. 

Given  is  a  set  W  of  n  numbers  from  the  range  [l..m],  where  p  =  m+  1  is  prime.  The  hash 
function  F  maps  W  into  the  range  [1..5n|.  We  use  the  fundamental  perfect  hash  function 
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F  that  was  suggested  by  Fredman,  Komlos  and  Szemeredi  [FKS84],  as  described  below.  An 
efficient  parallel  construction  of  F  is  then  presented. 

Define  fk  :  W  —>  [l..n]  as  fk{x)  =  1  +  (kx  mod  p)  mod  n  where  k  is  a  parameter 
from  [l..m].  Let  B(k,j)  be  the  set  of  values  in  W  that  are  mapped  by  fk  into  j,  i.e. 
B(k,j)  =  {x  €  W  :  fk{x)  =  j}.  Also,  let  b(k,j)  =  \B(k,j)\  and  Sk  =  £"=1  b(k,j)2.  For 
each  j  =  1, . . .  ,n,  define  fk,T  :  B(k,j )  — >  [l..r2]  as  /£,  r(x)  =  1  4-  ( k'x  mod  p )  mod  r2  where 
k'  =  k'(j)  is  a  parameter  from  [l..m]  and  r  =  b(k,j).  Construction  of  the  hash  function  F 
will  be  based  on  two  steps: 


(a)  Select  k  for  which  Sk  <  5 n. 

(b)  For  each  j ,  select  k'(=  k'(j))  for  which  f'k,b^ is  one-to-one. 


After  all  parameters  k  and  k'(j)'s  (for  all  j  =  l,...,n)  are  appropriately  selected,  the 
one-to-one  function  F  is  derived  by  first  applying  fk  and  then  fk'U),b(k,j)  for  a  proper  j. 
Specifically,  F  is  constructed  as  follows.  For  each  set  of  elements  B(k,j),  b(k,j )2  space  is 
assigned.  Let  Mi  be  the  prefix  sum  Ej=i  b(k,j)2.  Then  [1..A/,]  is  an  array  assigned  to  the  first 
i  sets  B(k ,  1), ...,  B(k ,  i).  The  function  fk  maps  each  element  in  B(k,j)  into  + 

F(x)  is  evaluated  as  follows: 

(a’)  Evaluate  j  =  fk{x). 

(b’)  F(x )  =  Mj-i  + 


Step  (a)  guarantees  that  the  overall  space  iV/n  =  Sk  is  linear  (<  5 n).  Step  (b)  guarantees 
that  the  mapping  is  one-to-one.  It  remains  to  show  how  to  implement  steps  (a)  and  (b). 

Fact  1  ([FKS84])  For  at  least  one-half  of  the  values  k  in  [L.m],  Sk  <  5 n.  Thus,  for  a 
randomly  selected  k,  Sk  <  5 n  with  probability  >  j. 

Define  k'  =  ( k'{j ))  to  be  good  if  fk'(j),k(k,j)  's  a  one-to-one  function  (over  B(k,  j)). 

Fact  2  ([FKS84])  For  each  j  in  [l..n],  at  least  one-half  of  the  k's  in  the  range  [l..m]  are 
good. 

To  construct  F,  we  apply  the  following  randomized  procedure: 


(a)  Repeatedly  select  k  at  random  until  Sk  <  on. 

(b)  For  each  j.  repeatedly  select  k1  =  k'(j)  at  random  until  it  is  good. 
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Implementation  of  Step  (a). 

Following  fact  1,  the  expected  number  of  iterations  in  (a)  is  <  2.  We  first  show  how 
to  check  whether  Sk  <  5 n.  Given  b(k,j)  for  all  j,  the  evaluation  of  the  prefix  sums  A/,-  = 
J2)-i  b(k,j)2  (for  i  =  1, n)  and  of  Sk  =  Mn  can  be  done  by  using  the  Prefix  Sums  algorithm 
of  Cole  and  Vishkin  [CV86]  [CV89]  in  O(logn/loglogn)  time  and  0(n )  operations.  The 
evaluation  of  b(k,j)  for  each  j  is  done  as  follows: 

(1)  Sort  the  n  numbers  in  {fk(x)  ■  x  €  W}  into  an  array  C[l..n]. 

(2)  Find  for  each  j  the  rightmost  (resp.  leftmost)  index  z‘i  (resp.  i 2)  for  which  C[z'i]  =  j 

(resp.  C[if\  =  j).  Let  b(k,j)  be  i'i  —  12  +  1. 


Step  (2)  can  be  trivially  done  in  0(1)  time  using  n  processors.  To  do  step  (1),  note  first 
that  the  range  of  fk  is  the  integer  interval  [l..n].  Thus,  we  may  employ  the  integer  sorting 
algorithm  due  to  Rajasekaran  and  Reif: 


Lemma  9  ([RR89])  n  keys  from  the  range  [l..n]  can  be  sorted  using  arbitrary-CRCW 
PRAM  processors  in  O(logn)  time,  with  probability  >  1  —  -M,  for  any  constant  a  >  0. 

Following  the  above  we  have  that  each  iteration  in  step  (a)  of  the  construction  of  F  takes 
O(n)  expected  number  of  operations  and  logarithmic  expected  time.  We  conclude 


Lemma  10  Given  is  a  set  W  of  n  numbers  from  the  range  [l..m]  and  some  k  €  [l..m]. 
Checking  whether  Sk  <  5 n  can  be  done  in  0( log  n)  expected  time,  using  j—  arbitrary-CRCW 
processors. 


Corollary  11  Step  (a)  in  the  construction  of  F  takes  O(logn)  expected  time,  using 
arbitrary-CRCW  processors. 


Implementation  of  Step  (b). 

In  step  (b)  the  procedure  to  check  whether  k'  is  good  for  j  is  easy  when  using  the 
arbitrary-CRCW  PRAM.  Our  goal  is  to  select  a  good  k'  for  each  j  within  a  total  of  O(n) 
operations  and  logarithmic  time.  The  difficulty  is  that  Step  (b)  should  be  done  independently 
for  each  j  ( j  =  l,...,n).  We  prove 


Lemma  12  A  good  k'  =  k'{j)  can  be  found  for  all  j ,  j  —  l,..,n,  in  O(logn)  expected  time, 
using  processors. 
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Proof:  To  prove  the  lemma  we  show  three  facts:  (I)  the  expected  maximum  number  of 
trials  in  selecting  a  good  k'  for  j  (over  j  =  l,..,n)  is  O(logn);  (II)  the  expected  total  work 
of  selecting  good  k's  for  all  j  ( j  =  1,..,  n)  is  0(n);  and  (III)  the  processors  can  be  allocated 
according  to  (a)  and  (b)  to  yield  an  O(logn)  expected  time  parallel  procedure  for  step  (b) 
in  the  construction  of  F,  using  processors. 

Let  tj  be  the  number  of  trials  before  a  good  k1  is  found  for  j,  and  let  t  =  max{tj  :  j  = 

Claim  13  E[t]  <  2(("logn]  +1). 


Proof:  Let  iV,-  be  the  number  of  js  for  which  a  good  k'  was  not  found  in  the  first  i  trials; 
i.e.  Ni  —  |{j  :  t:  >  z}|.  Consider  the  following  Bernoulli  trials:  A  success  in  the  i’th  trial  is 
defined  to  be  the  case  that  the  number  of  good  k's  found  in  the  z’th  trial  is  at  least  that 
is,  a  success  is  when  7V,+1  <  Following  fact  2,  Pro&fsuccess]  >  L. 

Let  x  be  the  number  of  trials  until  the  (flogn]  +  l)’th  successful  trial.  It  is  easy  to  see 
that  t  is  bounded  by  x  and  therefore  E[t ]  <  E[x]  <  2( flogn"]  +  1).  | 

Claim  13  proves  fact  (I).  To  prove  fact  (II)  we  first  show 


Claim  14  E[tj]  <  2  for  each  j. 


Proof: 


E[tj]  =  i  •  Prob[t: 

i=l 


GO  *  oo  i  1 

1=1  “  t=  1  k=  1  “ 


changing  the  order  of  summation,  we  get 


OO  CO  1  OO  1 

=  =  =  2- 
i=l  fc=i  “  i=l  “ 


■ 


Let  opj  be  the  number  of  operations  required  for  selecting  a  good  k'  for  j.  Since  opj  = 
tjb(k,j)  we  have  E[opj ]  =  E[tj]b(k,j)  <  2b(k,j)  and  the  total  number  of  operations  is 
expected  to  be  £>E”_1  opj]  <  2  b(k,j )  =  2n. 

It  remains  to  give  an  implementation  for  the  processors  allocation.  The  selection  of  the 
numbers  k'(j )  is  done  in  phases.  In  each  phase  a  new  number  k'  is  selected  and  tested,  for 
each  j  such  that  no  good  k'(j)  has  been  found.  An  element  x  6  B(k,j)  is  called  active  if  no 
good  k'(j )  has  yet  been  found  (for  j).  Let  N'  be  the  number  of  active  elements  in  phase  i. 
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We  use  a  standard  idea.  Initially,  N[  is  n.  As  long  as  >  n/  log  n,  we  simply  compact 
all  N-  active  elements  of  phase  i  into  an  array  of  length  N, ■  prior  to  the  phase.  Consider  the 
first  phase  j  for  which  N'-  <  n/logn.  The  compacted  array  of  size  N'-,  will  be  used  for  all 
subsequent  phases. 

Finally,  a  trivial  application  of  Brent’s  scheduling  theorem  will  provide  actual  assignment 
of  processors  to  jobs. 

Specifically,  let  V AL[l.. 5n]  be  an  array.  The  perfect  hash  function  F,  being  constructed, 
will  map  each  element  of  the  input  set  W  into  array  V AL. 

Less  informal  implementation  of  Step  (b). 


i  :=  1;  N-  :=  n. 

While  N-  >  n/logndo 

(All  active  elements  in  W  are  in  array  ACT IV E[l..N't]  sorted  by  the  B(k,j)  to  which 
they  belong) 

Phase  i 

1.  The  first  element  in  an  active  B(k,j )  selects  at  random  a  k'(j )  value. 

2.  Each  active  element  x  evaluates  its  hashing  value  F(x)  using  fk'{j)  and  writes  its 

original  value  x  into  V AL[F{x)]  (using  the  arbitrary  CRCW  convention). 

3.  Each  active  element  x  checks  whether  its  value  is  written  in  V AL[F(x)\.  If  not  it 

“disqualifies”  the  k'(j)  for  its  B(k,j)  set. 

4.  All  active  elements  belonging  to  B(k,j)  sets  whose  k'(j)  was  disqualified  remain 

active  in  phase  number  i  +  1.  Their  number  is  N'i+l. 

5.  Using  a  prefix  sums  algorithm,  compact  them  into  array  ACTIVE[l..N’i+1] 

6.  i  :=  i  +  1. 

end  while 

Denote  j  =  i.  N'  is  at  most  n/logn. 

While  there  is  j  for  which  good  k'(j)  has  not  been  found  do 

(All  active  elements  in  W  are  in  array  ACT IV E[l..Nj],  sorted  by  the  B(k,j )  to  which 
they  belong) 

Do  steps  L-4  and  6  as  above, 
end  while 
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Complexity  Following  fact  2,  E[N{]  <  ~r.  Therefore,  the  expected  total  work  is  52, •  E[N'{]  < 
2 n.  The  time  in  each  phase  in  part  1  is  dominated  by  the  compaction  procedure  (step  5). 
Using  the  Prefix  Sums  algorithm  each  phase  takes  time.  Using  arguments  similar 

to  those  in  Claim  13,  E[j]  <  2  log  log  n  where  j  is  the  number  of  phases  in  the  first  part. 
The  expected  number  of  phases  for  part  2  is  O(log  n)  based  on  Claim  13. 

Using  Brent’s  theorem  [Bre74]  step  (b)  can  be  implemented  in  O(logn)  expected  time, 
using  processors  on  an  arbitrary-CRCW  PRAM.  ■ 

Having  Lemma  10  and  Lemma  12,  Theorem  2  immediately  follows. 


3.2  Applications 

As  a  motivation  for  the  previous  subsection  we  stated  Corollary  8. 

Proof  of  Corollary  8.  In  algorithm  DNN  (and  its  extension)  there  are  log  log  m  phases. 
Separately  for  each  phase  we  hash  the  O(n)  variables  a and  6*.  Part  A  of  the  Corollary  8 
follows.  Part  B  is  trivial.  g 

We  mention  here  some  examples  of  algorithms  for  which  the  parallel  hashing  scheme  can 
be  used. 

One  application  relates  to  the  construction  of  suffix  trees.  This  is  probably  the  most 
important  data  structure  for  algorithms  on  strings.  Applications  of  this  data  structure 
are  reviewed  in  [Apo34].  [GGSS]  indicate  that  the  space  requirement  of  the  suffix  tree 
construction  is  the  source  of  inefficiency  in  quite  a  few  parallel  preprocessing  algorithms.  The 
parallel  algorithm  by  [AIL+88]  for  constructing  suffix  tree  requires  O(logn)  time,  0(n  log  n) 
operations  and  0(n1+<)  space  (for  any  0  <  e  <  1),  where  n  is  the  length  of  the  input  string. 
Using  the  parallel  hashing  scheme,  the  space  requirement  decreases  to  O(n),  while  the  time 
increases  to  expected  O(log2  n )  and  the  number  of  operations  remains  0(n  logn)  (as  expected 
value  rather  than  worst  case).  Suppose  we  have  a  relatively  short  string,  whose  length  is 
m,  and  a  long  string,  whose  length  is  n.  [GG88]  considered  instances  where  suffix  trees 
are  needed  only  for  supporting  queries  requesting  comparison  among  substrings  of  the  short 
string  and  the  long  string.  Their  algorithm  takes  O(logm)  time,  O(nlogm)  operations  and 
O(nlogn  +  m2)  space.  Using  the  parallel  hashing  scheme,  the  space  requirement  decreases 
to  0(n  logn),  the  time  increases  to  expected  O(lognlogm)  and  the  (expected)  number  of 
operations  remains  0(n  log  m). 

Section  2  in  the  approximate  string  matching  survey  of  [GGSS]  discusses  various  ways  for 
hashing  many  different  substrings  of  a  certain  string.  This  is  a  fundamental  problem  that 
arises  in  some  string  matching  automaton-like  algorithms.  They  consider  both  serial  and 
parallel  computation.  Given  a  string  i,  the  size  of  the  alphabet  is  relevant  to  the  assignment 
of  names  to  different  substrings  of  x.  In  principle,  different  names  should  be  assigned  to 
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different  substrings  of  a  given  string  x.  In  cases  where  the  number  of  different  substrings  is 
n,  a  name  should  be  a  number  from  [l..n].  Name  assignment  i§  a  mapping  from  a  possibly 
large  domain  into  [l..n].  Galil  and  Giancarlo  propose  an  assignment  procedure  that  takes 
0(n  log  n)  operations  where  |x|  =  n  [GG88];  subsequently,  it  takes  O(logn)  time  to  find  the 
name  of  a  substring.  Using  our  parallel  hashing  scheme,  the  assignment  procedure  takes 
0(n )  expected  number  of  operations  and  0( log  n)  expected  time;  finding  a  name  for  a  given 
substring  takes  0(1)  worst-case  (!)  time. 

Lamdan  and  Wolfson  [LW88]  use  hashing  for  object  recognition.  Their  method  is  parallel 
in  a  straightforward  manner,  except  for  their  hash  table  construction.  The  parallel  hashing 
scheme  can  be  useful  there. 

Hagerup’s  algorithm  for  sorting  integers  from  polynomial  range  [Hag87]  has  the  drawback 
of  using  0(n1+e)  space  (for  any  fixed  e  >  0).  By  using  the  parallel  hashing  scheme  its  space 
complexity  decreases  to  0(n).  The  expected  number  of  operations  remains  the  same  and 
the  time  increases  from  O(logn)  to  expected  O(log  n  log  log  n). 

Finally,  the  parallel  hashing  scheme  is  used  to  get  an  optimal  randomized  simulation  of 
the  min-CRCW  PRAM  by  arbitrary-CRCW  PRAM,  as  given  in  Section  4. 

Comment  on  finding  a  prime  in  a  given  range. 

We  assumed  above  that  m  +  1  is  a  prime.  To  withdraw  this  assumption  we  should  give 
a  procedure  that,  given  some  m,  finds  a  prime  p  >  m  such  that  log  log  p  =  O(loglogm);  i.e. 
P  €  [(m  +  l)..mlo5*m],  for  some  constant  k  >  0.  We  have  some  preliminary  results  on  this. 
In  particular,  we  know  how  to  find  p  in  O(n)  operations.  The  parallel  time  complexity  is 
O(log3m)  with  high  probability.  To  see  the  significance  of  such  a  procedure  we  should  refer 
to  the  way  in  which  the  sorting  algorithm  is  viewed.  If  the  algorithm  is  for  a  fixed  range 
[l..m]  then  finding  p  is  just  a  preprocessing  which  may  be  done  only  once  (p  can  then  be 
part  of  the  algorithm’s  description).  We  may,  however,  use  the  sorting  algorithm  as  an  input 
sensitive  algorithm  with  no  a  priori  knowledge  about  the  range.  Specifically,  after  reading 
the  input  values  the  actual  range  may  be  found  by  using  a  maximum  finding  procedure.  In 
this  case,  an  efficient  procedure  for  finding  a  prime  p  is  desired. 


4  Simulating  the  min-CRCW  PRAM 


In  this  section  we  deal  with  simulations  of  the  min-CRCW  PRAM  by  weaker  (and  more  ac¬ 
ceptable)  models  of  parallel  computation.  We  show  applications  of  some  of  these  simulations 
for  the  parallel  Sorting  algorithms. 

Our  most  interesting  simulation  result  is  the  following: 
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Simulation  result  1  One  step  of  an  n-processor  min-CRCW  PRAM  can  be  simulated  by 
an  j^-processor  arbitrary-CRCW  PRAM  in  O(logn)  expected  time  (optimal  speed  up)  and 
0(n )  additional  space. 

Before  proceeding  to  prove  this  simulation  result,  we  make  some  general  comments  on 
how  the  result  should  be  read  and  what  has  to  be  proved.  These  comments  apply  to  other 
simulation  results  below,  as  well.  The  difference  between  the  min-CRCW  PRAM,  being 
simulated,  and  the  simulating  arbitrary-CRCW  PRAM  lies  in  the  way  write  conflicts  are 
resolved.  For  this  reason  our  proof  need  to  be  concerned  only  with  a  ‘write’  stage  of  the  min- 
CRCW  PRAM  on  the  arbitrary-CRCW  PRAM.  The  space  requirements  for  the  simulating 
arbitrary-CRCW  PRAM  be  read  as  follows:  (1)  it  needs  as  much  space  as  the  min-CRCW 
PRAM;  in  addition,  (2)  0(n )  space  is  needed. 

Lemma  15  Consider  the  problem  of  simulating  a  single  ‘write’  stage  of  an  n-processor  min- 
CRCW  PRAM  on  an  arbitrary-CRCW  PRAM.  This  problem  can  be  reduced  in  O(loglogn) 
time  and  0(n )  operations  (on  an  arbitrary-CRCW  PRAM),  to  the  problem  of  Sorting  n 
integers  from  the  range  [l..n].  The  reduction  uses  0(m )  space,  where  m  is  the  size  of  the 
memory  in  the  simulated  min-CRCW  PRAM. 

Proof:  Suppose  the  memory  of  the  min-CRCW  PRAM  is  an  array  A/[l..m]  of  size  m. 
We  denote  the  processors  of  the  simulated  min-CRCWr  PRAM  by  A/P,,  1  <  :  <  n.  As 
usual  we  will  refer  to  the  computation  on  the  simulating  arbitrary-CRCW  PRAM  in  terms 
of  operations,  and  suppress  the  issue  of  allocation  of  these  operations  to  processors  of  the 
simulating  machine.  We  will  make  one  exception  to  this,  in  a  case  where  such  allocation 
requires  special  care.  A  typical  ‘write’  stage  of  n  min-CRCW  PRAM  processors  can  be 
viewed  as  follows.  Processor  MP{,  1  <  i  <  n,  attempts  to  write  value  u;  into  target  address 
M[ti\.  Let  5,  be  the  set  of  elements  j  such  that  tj  =  tt.  The  definition  of  the  min-CRCW 
PRAM  implies  that  V{  is  written  into  A/[t,j  if  Vi  =  min{uj  :  j  €  5J. 

The  simulation  makes  use  of  a  bulletin  board  BB[l..m ]  that  enables  direct  communication 
between  all  elements  with  the  same  target  address.  It  works  as  follows: 

a.  For  each  processor  M P, ,  write  its  index  i  into  memory  location  BB[ti].  Arbitrarily,  some 

index  i'  €  5,,  succeeds  and  i'  is  written  into  BB[t,\. 

The  main  idea  is  to  “label”  each  processor  in  5,  by  the  same  label  i'  and  group  all 
processors  in  5;  together  into  a  successive  subarray  (Step  (b)).  The  simulation  of  the 
min-CRCW  PRAM  is  carried  out  by  determining  the  minimum  value  u,-  over  each  such 
successive  subarray  (Step  (c)). 

b.  LABEL[i\  <—  BB[t, ].  Sort  array  LABEL  into  an  array  G[l..nj. 

Identical  LABEL  values  occupy  successive  subarrays  of  G.  The  beginning  and  end  of 
each  such  subarray  can  be  easily  determined. 
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c.  For  each  such  successive  subarray  in  G ,  find  the  minimum  V{  over  {uj  :  LABEL\j]  is  in 
the  subarray},  and  write  u,-  into  memory  location  M[U\. 


Step  (a)  can  be  done  in  0(1)  time  and  0(n)  operations  using  the  arbitrary- CRCW 
PRAM.  Step  (c)  can  be  done  in  time  O(loglogn)  and  0(n)  operations. 

We  comment  on  how  to  actually  perform  step  (c)  using  n/log’ogn  processors  within 
O(loglogn)  time.  Step  '  employ  tl:-’  *’/  ,jline  optimal  speed  up  algorithm 

for  finding  the  minimum  among  l  elements  given  in  an  array  of  length  /,  as  in  Shiloach 
&  Vishkin  [SVSlj.  The  processors  allocation  for  Step  (c)  is  done  as  follows.  In  order  to 
simplify  the  explanation,  assume  that  we  have  3«/!cglogn  processors.  We  assign  three 
processors  to  each  successive  subarray  (called  interval)  of  length  log  log  n  in  array  LABEL. 
These  three  processors  will  participate  in  the  minimum  computation  of  all  subarrays  that 
intersect  their  interval.  If  there  is  a  subarray  that  ends  in  the  interval,  we  assign  to  it  the 
first  processor.  If  there  are  subarrays  that  are  contained  in  the  interval,  we  assign  to  all 
of  them  the  second  processor.  If  there  is  a  subarray  that  begins  in  the  interval,  we  assign 
to  it  the  third  processor.  If  the  interval  is  fully  contained  in  a  subarray  assign  to  it  the 
first  processor  (and  the  other  two  will  remain  idle).  Each  processor  of  each  interval  clearly 
finishes  finding  the  (local)  minima  for  the  intersection  of  the  interval  with  each  intersecting 
subarray  in  0(log  log  n)  time.  The  first  and  third  processors  may  then  participate  in  a  global 
minimum  computation  of  an  appropriate  subarray.  Such  global  computation  will  apply  the 
Shiloach-Vishkin  algorithm  for  each  subarray  separately. 

The  space  complexity  is  dominated  by  the  array  BB  whose  size  is  m.  The  values  written 
in  the  memory  locations  BB[ti ]  in  Step  (a)  are  from  the  range  [l..n].  Therefore,  the  sorting 
problem  of  step  (b)  is  indeed  of  elements  from  this  range  only.  The  Lemma  follows,  f 

A  major  drawback  of  the  above  reduction  is  the  potentially  large  space  that  it  might 
require.  (Recall  that  this  space  is  in  addition  to  the  memory  which  is  as  in  the  simulated 
min-CRCW.)  However,  the  space  consuming  array  BB  is  only  used  in  step  (a).  It  is  easy 
to  see  that  the  parallel  hashing  scheme  from  section  3  can  be  applied  here.  As  a  result  the 
additional  space  which  is  used  by  the  reduction  is  reduced  to  0(n)  while  the  time  complexity 
remains  the  same  (up  to  a  constant  factor),  only  expected  rather  than  deterministic,  using 
the  same  number  of  processors. 

Thus,  Simulation  Result  I  follows  from  Lemma  15,  Lemma  9  and  Theorem  2. 
Comments: 

1.  Simulation  result  1  improves  a  similar  theorem  in  the  survey  of  Eppstein  and  Galil 
[EG88]  (the  min-CRCW  is  called  there  strong-CRCW),  where  it  is  assumed  that  addresses 
can  be  written  in  at  most  O(logn)  bits.  Recall  that  Simulation  result  1  does  not  impose  any 
size  restriction  on  the  memory  to  be  used  by  the  simulated  machine. 


2.  Lemma  15  can  be  extended  to  hold  in  0(log  n/  log  logn)  times  for  a  fetch&add  step  of 
a  fetch&add-CRCW  PRAM.  This  extension,  Lemma  9  and  Theorem  2  imply  the  following 
extensions  of  simulation  result  1:  A  single  fetch&add  step  of  n  processors  on  a  fetch&add- 
CRCW  PRAM  can  be  simulated  by  arbitrary-CRCW  PRAM  processors  in  O(logn) 
expected  time  and  0(n)  space. 

Simulation  Result  1  states  that  a  single  step  of  an  n-processor  min-CRCW  can  be  simu¬ 
lated  with  0(n)  expected  number  of  operations  by  an  arbitrary-CRCW  PRAM.  To  see  what 
can  be  done  deterministically  we  first  state  the  following  lemma  due  to  Hagerup: 

Lemma  16  ([ Hag87])  For  any  fixed  e  >  0,  n  integers  of  size  polynomial  in  n  can  be  sorted  in 
O(logn)  time  by  a  priority-CRCW  PRAM  using  0 ( )  processors  and  0(n1+e )  space. 


Following  Lemma  15  and  Lemma  16  we  have 

Simulation  Result  2  A  single  step  of  n  processors  on  a  min-CRCW  PRAM  with  memory 
size  5  can  be  simulated  by  priority-CRCW  PRAM  processors  in  O(logn)  time  and 

0(S  -f  n1+{)  space  (for  any  fixed  e  >  0). 

Comment  3.  Simulation  Result  2  can  be  extended  for  a  fetch&add-CRCW  PRAM  similarly 
to  Comment  2  above. 

Two  more  simulation  results  are  stated  here.  They  are  based  on  [CDHR88],  as  explained 
in  Appendix  B. 

Simulation  Result  3  An  n-processor  min-CRCW  that  uses  memory  of  size  5,  where  each 
memory  cell  contains  a  logm-bit  word,  can  be  simulated  by  an  n-processor  arbitrary-CRCW 
PRAM  with  a  slowdown  of  O(loglogm),  using  O(mS)  space. 

Simulation  Result  4  An  n-processor  min-CRCW  that  uses  memory  of  size  5,  where  each 
memory  cell  contains  a  logm-bit  word,  can  be  simulated  by  an  n logm-processor  common- 
CRCW  PRAM  in  0(1)  time,  using  O(mS)  space. 


4.1  Applications 

In  this  section  we  apply  some  of  the  simulations  to  the  parallel  DNN  and  Sorting  algorithms 
and  derive  complexity  results  for  standard  CRCW  PRAM  models. 

Recall  that  in  algorithm  DNN  there  are  O(loglogm)  phases.  Each  phase  takes  0(1)  time 
for  n  min-CRCW  processors.  Following  Simulation  Results  1,  2,  3  and  4  we  have 


Corollary  17  Algorithm  DNN  and  its  extension  can  be  implemented  as  follows: 


•  On  arbitrary- CRCW  in  0(log  n  log  log  m)  expected  time,  0(n  log  log  m)  expected  num¬ 
ber  of  operations  and  0{n )  space. 

•  On  priority- CRCW  in  0(log  n  log  log  m)  time,  0(n  log  log  m  log  log  n)  operations  and 
0(m‘ )  space,  for  any  fixed  e  >  0. 

•  On  arbitrary- CRCW  in  O((log  log  m)2)  time,  0(n(log  log  m)2)  operations  and  0(me ) 
space,  for  any  fixed  e  >  0. 

•  On  common-CRCW  in  O(loglogm)  time,  0(n  log  m  log  log  m)  operations  and  0(me ) 
space,  for  any  fixed  e  >  0. 

Proof:  The  first  implementation  is  as  follows.  Recall  that  the  opening  sentence  of  Section  2.4 
explains  how  to  trivially  extend  the  DNN  algorithm  for  non-distinct  elements  using  0(nm) 
variables.  At  each  phase  of  algorithm  DNN,  Theorem  2  is  used  to  map  the  O(nm)  variables 
into  O(n)  space  in  O(logn)  expected  time  and  O(n)  expected  number  of  operations.  The 
step  of  the  min-CRCW  PRAM  is  then  done,  using  Simulation  Result  1,  in  O(logn)  expected 
time  and  0(n )  expected  number  of  operations,  using  0(n )  space.  There  are  O(loglogm) 
phases  by  Lemma  4. 

For  the  other  implementations  we  use  the  reduced  space  DNN  algorithm  of  Lemma  6. 
Thus,  at  each  phase  we  implement  a  step  of  min-CRCW  PRAM  with  0(mf/2)  space,  for  any 
fixed  €  >  0.  We  use  Simulations  Results  2,3  and  4  to  implement  the  second,  third  and  forth 
implementations,  respectively.  Note  that  the  values  written  by  the  min-CRCW  PRAM  are 
at  most  0(m^2).  g 

Recall  that  after  running  algorithm  DNN,  the  Sorting  algorithm  can  be  finished  by  using 
the  List  Ranking  procedure.  The  latter  takes  O(logn)  time,  O(n)  operations  and  O(n)  space 
on  EREW  PRAM.  Thus,  Sorting  can  be  implemented  with  the  same  complexities  as  stated 
in  Corollary  17  except  for  an  increase  in  time  to  (3(logn)  in  the  last  2  cases. 

In  particular,  the  first  item  in  Corollary  17  implies  the  following  parallel  sorting  result. 

Theorem  3  Sorting  n  integers  from  the  range  [l..m]  can  be  done  on  a  randomized  arbitrary- 
CRCW  in  0(log  n  log  log  m)  expected  time,  0(n  log  log  m)  expected  number  of  operations  and 
O(n)  space. 


5  Conclusion 


We  gave  an  o(n  log  n)  time  randomized  algorithm  for  sorting  integers  drawn  from  a  super¬ 
polynomial  range.  Our  algorithm  takes  0(n  log  log  m)  expected  time  and  O(n)  space.  A 
parallel  version  of  the  algorithm  achieves  optimal  speed  up. 
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Sorting  is  a  fundamental  problem  in  computer  science,  therefore  we  expect  these  results 
to  have  applications  in  quite  a  few  problems.  We  outline  one  possible  direction  for  such 
applications.  Consider  a  problem  that  is  defined  on  an  m  by  m  grid  of  points,  and  suppose 
an  algorithm  for  the  problem  needs  sorting  of  points  along  the  x  or  y  axes.  Our  integer 
sorting  algorithm  is  likely  to  be  helpful. 

An  open  question  is  whether  a  space  efficient  deterministic  Integer  Sorting  algorithm  in 
o(n  log  n)  time  can  be  found,  for  integers  drawn  from  a  superpolynomial  range  [l..npo/y,oy(n)]. 

The  second  main  topic  is  the  parallel  hashing  technique.  It  achieves  optimal  speed  up  and 
takes  expected  logarithmic  time.  The  parallel  hashing  technique  enables  drastic  reduction 
of  space  requirements  for  the  price  of  using  randomness.  The  technique  was  used  in  the 
parallel  sorting  algorithm  and  in  the  simulation  results;  its  applicability  to  other  problems 
was  demonstrated. 

An  open  question:  design  an  optimal  parallel  speed  up  hashing  scheme  F  :  W  — >  [1..0(n)] 
that  takes  sublogarithmic  time. 

A  third  topic  of  independent  interest  is  the  efficient  simulation  of  strong  models  of  CRCW 
PRAM  by  more  acceptable  ones,  and  the  methodology  of  designing  parallel  algorithms  for 
the  strong  models  with  automatic  simulations  on  acceptable  models.  We  expect  to  see  more 
applications  of  this  approach  in  the  design  of  parallel  algorithms. 
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A  Proof  of  Proposition  1 


This  appendix  gives  a  proof  of  Proposition  1  for  completeness  of  the  presentation.  We  suggest 
not  to  include  this  appendix  in  a  published  version  of  the  paper. 

Proposition  1  Let  x\j\  be  the  left  neighbor  of  x[t].  Then  for  each  level  of  the  recursion  the 
following  properties  hold: 

(a)  The  values  of  all  elements  in  L  are  distinct  and  the  values  of  all  elements  in  R  are 
distinct. 

(b)  xt[i]  and  xr[j]  are  both  represented  in  the  same  recursive  sub-problem. 

(c)  xr[j]  <  xt[i}. 

(d)  xr\j ]  is  the  left  neighbor  of  x/[t). 

(e)  Each  element  is  represented  in  exactly  one  recursive  sub-problem. 

(f)  Set  L  is  nonempty  if  and  only  if  set  R  is  nonempty. 

Proof: 

The  proof  is  by  induction  on  the  level  of  the  recursion. 

Inductive  Base:  Trivial. 

Induction  step: 

(а) :  After  step  2  the  values  of  elements  in  Lk  and  in  Rk  are  unchanged  and  therefore,  by 
the  inductive  hypothesis  on  (a),  we  only  need  to  look  at  the  sets  GL  and  GR.  In  step  2,  for 
each  k  at  most  one  element  from  Lk  is  selected  into  GL ,  and  at  most  one  element  from  Rk 
is  selected  into  GR.  These  (possibly)  selected  elements  are  the  only  ones  that  are  assigned 
with  the  value  k. 

(б) :  Following  from  the  inductive  hypothesis  on  (6),  before  step  1  elements  x;[z]  and 
xr[j\  are  both  represented  in  the  same  recursive  sub-problem.  Assume  that  after  step  1 
x;[z]  €  Lk  and  xr[j\  6  Rk'  for  some  k  and  k' .  If  x/[z]  was  selected  into  GL ,  i.e.  x/[i]  is 
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the  smallest  element  in  LkijRk,  then  xr\j]  was  selected  into  GR ,  i.e.  xr\j ]  is  the  largest 
element  in  Lk>\JRk>-  Otherwise,  we  have  from  the  inductive  hypothesis  on  (a)  and  (c) 
that  xr[j]  <  bk>  <  £/[z]  (before  bk>  was  changed  in  set  GR)  which  contradicts  the  inductive 
hypothesis  on  (d).  Similarly,  if  xT\j\  was  selected  into  GR  then  x;[z]  was  selected  into  GL , 
otherwise  we  have  from  the  inductive  hypothesis  on  (a)  and  (c)  that  xr[j]  <  a*  <  x/[z],  which 
contradicts  the  inductive  hypothesis  on  (d).  If  both  x/[z]  and  xr[j ]  were  not  selected  into  GL 
and  GR,  respectively,  then  k  =  k'  otherwise  by  the  inductive  hypothesis  on  (a)  and  (c)  we 
have  xr[j]  <  a.k  <  x/[z]  which  contradicts  the  inductive  hypothesis  on  (d).  Thus,  x([z]  and 
xr[j]  are  in  GL  and  GR,  respectively,  or  in  Lk  and  Rk,  respectively. 

(c) :  Following  from  the  inductive  hypothesis  on  ( b ),  if  x;[z]  and  xT\j]  are  in  Lk  and  Rk , 
respectively,  then  their  values  are  unchanged.  If  they  are  in  GL  and  in  GR,  respectively, 
then  after  step  2  xj[z]  =  k  and  xT\j\  =  k'  (where  after  step  1  x;[z]  E  Lk  and  xT\j ]  €  Rk1)- 
Following  from  the  definition  of  Z*  and  Rk>  in  step  1  and  from  the  inductive  hypothesis  on 
(c)  we  have  k  <  k' . 

(d) :  The  values  of  all  elements  in  sets  Z*  and  Rk  are  unchanged.  Thus,  following  from 
the  inductive  hypothesis  on  ( b )  and  (d),  if  x;[z]  6  Lk  and  xr\j ]  E  Rk  then  xr[j]  is  the  left 
neighbor  of  xT\j\.  We  only  need  to  deal  with  the  case  that  x;[z]  6  GL  and  xr[j]  E  GR. 
Assume,  by  negation,  that  there  is  an  element  y  E  GR  such  that  xr[j]  <  y  <  x;[z].  Clearly, 
after  step  1  we  had  x/[z]  E  Lk ,  xr[j]  E  Rk'  and  y  E  Rk"  with  k'  <  k"  <  k ,  contradicting  the 
inductive  hypothesis  on  (d). 

(e) :  Following  from  the  inductive  hypothesis  on  (e),  in  step  (2)  each  element  x;[z]  is  either 
in  Zfc  for  some  k  or  in  GL.  Similarly,  each  element  xr[j]  is  either  in  Rk>  for  some  k'  or  in 
GR.  Thus,  in  step  (3)  each  element  is  represented  in  exactly  one  recursive  sub-problem. 

(/):  Following  from  the  inductive  hypothesis  on  (6).  ■ 


B  Simulation  Results  3  and  4 


This  appendix  gives  a  proof  of  Simulation  Results  3  and  4  for  completeness  of  the  presenta¬ 
tion.  We  suggest  not  to  include  this  appendix  in  a  published  version  of  the  paper. 

Consider  some  CRCW  machine.  We  say  that  a  memory  cell  contains  a  processor  if  this 
processor  has  the  address  of  this  cell  written  in  a  special  local  register.  Each  processor 
is  contained  in  at  most  one  cell.  A  cell  that  contains  at  least  one  processor  is  said  to  be 
non-empty,  otherwise  it  is  said  to  be  empty.  The  Find-First  problem  of  size  n  is  defined 
as  follows:  Given  is  an  array  A(l..n],  each  cell  of  which  is  empty  or  contains  exactly  one 
processor,  find  the  lowest-numbered  non-empty  cell  of  A.  We  similarly  define  the  Extended 
Find-First  problem  of  size  n:  Given  is  an  array  A[l..n],  each  cell  of  which  is  empty  or  contains 
one  or  more  processors,  find  the  lowest-numbered  non-empty  cell  of  A. 
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Chlebus,  Diks,  Hagerup  and  Radzik  [CDHR88]  show  how  to  simulate  priority-CRCW 
PRAM  on  weaker  CRCW  PRAMs.  Their  simulations  are  based  on  solving  the  Find-First 
problem2.  Specifically,  a  ‘write’  step  of  n-processor  priority-CRCW  is  done  as  follows.  For 
each  memory  cell  c  to  which  there  is  at  least  one  processor  that  attempts  to  write,  a  Find- 
First  problem  of  size  n  is  defined:  Ac[i ]  contains  processor  Pt-  if  Pi  attempts  to  write  into  cell 
c.  The  space  required  for  the  simulating  machine  is  0(Sn )  where  S  is  the  size  of  memory 
being  used  by  the  simulated  priority-CRCW  PRAM.  Time  requirements  are  discussed  later. 

Let  min-CRCW(S,m)  be  a  min-CRCW  PRAM  with  space  S  and  m  being  an  upper  bound 
on  the  values  that  can  be  written  into  the  memory  cells.  A  ‘write’  step  of  min-CRCW(5,  m) 
PRAM  can  be  done,  similarly  to  the  above,  by  using  S  Extended  Find-First  problems  of 
size  m.  For  each  memory  cell  c  to  which  there  is  at  least  one  processor  that  attempts  to 
write,  an  Extended  Find-First  problem  of  size  m  is  defined:  Ac[i]  contains  processor  Pj  if  Pj 
attempts  to  write  i  into  cell  c.  The  space  required  for  the  simulating  machine  is  0{Sm).  A 
‘write’  step  of  the  min-CRCW(5,  m)  PRAM  can  be  also  reduced  to  S  Find-First  problems 
of  size  nm.  Specifically,  Ac[<  i,j  >]  contains  processor  Pj  if  P:  attempts  to  write  i  into  cell 
c  (here  ‘lowest  number’  is  with  respect  to  the  lexicographic  ordering).  The  space  required 
for  the  simulating  machine  here  is  O(Snm). 

Chlebus  et  al.  show  how  to  solve  the  Find-First  problem  of  size  m  on  several  machines: 


Proposition  18  ([CDHR88])  The  Find-First  problem  of  size  m  can  be  solved  as  follows: 
(a)  On  arbitrary- CRCW  PRAM  in  O(loglogm)  time;  (b)  On  common-CRCW  PRAM  in 
constant  time,  provided  that  each  processor  has  additional  logm  processors. 


The  algorithms  in  [CDHR88]  solve  also  the  Extended  Find-First  problem  since  all  proces¬ 
sors  contained  in  the  same  cell  act  identically  (independently  of  their  index).  This  is  enough 
when  using  the  arbitrary-CRCW  and  the  common-CRCW  PRAMs.  Using  the  above  we 
have 

Simulation  Result  3  An  n-processor  min-CRCW(5,  m)  PRAM  can  be  simulated  by  an 
n-processor  arbitrary-CRCW  PRAM  with  O(loglogm)  slowdown,  using  0{mS)  space. 

Simulation  Result  4  An  n-processor  min-CRCW(5,  m)  PRAM  can  be  simulated  by  an 
n  log  m-processor  common-CRCW  PRAM  in  0(1)  time,  using  0(mS)  space. 


•Fich,  Ragde  and  Wigderson  [FRVV88]  also  used  the  Find-First  problem  (there  called  “leftmost  prisoner 
problem")  to  simulate  priority-CRCW  on  common-CRCW  PRAM. 
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